For profiling I used TAU (https://tau.uoregon.edu/), since I wanted to try it out. TAU can report how much time was spent on each function, and I used this to report only the time taken on each call to count_sort (or qsort).
For statistical significance each timing should be repeated at least 10 times and so take mean and standard deviation in consideration. I had part of the analysis prepared for this, but not all simulations concluded in time and I reverted to the single case.
Python modules used in analysis:
In [1]:
%matplotlib inline
from IPython.core.display import HTML
from collections import defaultdict
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from analysis_scripts import prepare_panel
In the i-loop case $i, j, \text{count}$ can be private.
$a$ and $\text{temp}$ should be shared, since $a$ is read-only but can be big (and so have a substantial overhead if it is private) and $\text{temp}$ is written by all threads, but only at an specific position for each thread (so there aren't race conditions).
$n$ is a constant in this context, but it needs to be shared or each thread is going to allocate a new variable, and since it is not initialized there will be errors.
In [30]:
!pygmentize -l c -f html -O full -o ./iloop.html ../src/omp_iloop.c
HTML(open('iloop.html', 'r').read())
Out[30]:
In the j-loop case $j$ should be private. $\text{count}$ needs to be shared, since every thread might potentially update it. This also means threads need to wait if another thread is already updating $\text{count}$.
$a$ is read-only but can be big (and so have a substantial overhead if it is private), so I made it shared.
$i$ and $n$ are constant in this context, but they needs to be shared or each thread is going to allocate a new variable, and since it is not initialized there will be errors.
In [3]:
!pygmentize -l c -f html -O full -o ./jloop.html ../src/omp_jloop.c
HTML(open('jloop.html', 'r').read())
Out[3]:
I decided to use $n = 10, 1000, 1000, 10000, 1000000$, and $1, 2, 4, 8, 10, 16, 20$ threads.
For the Xeon Phi I used $1, 2, 4, 8, 16, 32, 64, 128, 240$ threads
In [4]:
VECTOR_SIZE = [10, 100, 1000, 10000, 100000]
THREADS = [1, 2, 4, 8, 10, 16, 20]
PHI_THREADS = [1, 2, 4, 8, 16, 32, 64, 128, 240]
This is my implementation using qsort.
To keep it consistent with i-loop and j-loop OpenMP implementation (and so share the analysis code), I implemented the qsort call inside another function, $\text{qsort_lib}$.
In [5]:
!pygmentize -l c -f html -O full -o ./qsort.html ../src/qsort.c
HTML(open('qsort.html', 'r').read())
Out[5]:
These commands create the executables, submit it to run on the HPCC job system and finally extract the raw TAU info into a text file. They are all implemented in the Makefile.
In [6]:
!cd .. && make exec/qsort-GNU
!cd .. && make workdir/qsort.pbs
!cd .. && make TAU_PROFILES_QSORT
In [7]:
data = prepare_panel('qsort', threads=[1])
pn_qsort = pd.Panel4D(data)
# dimensions: pn[vector size, threads, function name, metric]
pn_qsort.ix[:,:,'qsort_lib','Inclusive usec/call']
Out[7]:
In [9]:
pn_qsort.ix[:,:,'qsort_lib','Inclusive usec/call'].T.plot(marker='o', logy=True, logx=True)
plt.xlabel('Number of threads')
plt.ylabel('Microseconds')
Out[9]:
In [12]:
!cd .. && make exec/omp_iloop-GNU
!cd .. && make workdir/omp_iloop.pbs
!cd .. && make TAU_PROFILES_ILOOP
In [13]:
data = prepare_panel('omp_iloop')
pn_iloop = pd.Panel4D(data)
# dimensions: pn[vector size, threads, function name, metric]
pn_iloop.ix[:,:,'count_sort','Inclusive usec/call']
Out[13]:
In [14]:
pn_iloop.ix[:,:,'count_sort','Inclusive usec/call'].plot(marker='o', logx=True, logy=True)
plt.xlabel('Number of threads')
plt.ylabel('Microseconds')
Out[14]:
In [15]:
speedup_iloop = pn_iloop.ix[:,1,'count_sort','Inclusive usec/call'] / pn_iloop.ix[:,:,'count_sort','Inclusive usec/call']
speedup_iloop
Out[15]:
In [16]:
speedup_iloop.plot()
plt.plot(THREADS, THREADS, '--', label='Linear speedup')
plt.legend(loc='best')
plt.xlabel('Number of threads')
plt.ylabel('Speedup')
plt.title("$i$-loop parallel for")
Out[16]:
As expected, for small vector sizes synchronization and small workloads make the function actually slower than the sequential case.
For bigger vector sizes there is a significant speedup, specially in the $n=100000$ case.
Probably for $t=20, v=10000$ there was a competition for resources in other processes running on the node (and that's why more repetition and means/standard deviation is needed).
In [17]:
speedup_iloop_qsort = pn_qsort.ix[:,1,'qsort_lib','Inclusive usec/call'] / pn_iloop.ix[:,:,'count_sort','Inclusive usec/call']
speedup_iloop_qsort
Out[17]:
In [18]:
speedup_iloop_qsort.plot()
plt.legend(loc='best')
plt.xlabel('Number of threads')
plt.ylabel('Speedup')
plt.title("$i$-loop parallel for")
Out[18]:
Using qsort as base case we can see how a better algorithm is more important on the long run. Since qsort is $O(n\times log n)$ in the average case, the parallel optimization of a $O(n^2)$ algorithm is still slower.
Despite that, for small vector sizes count_sort was actually faster on the sequential case.
In [24]:
!cd .. && make exec/omp_jloop-GNU
!cd .. && make workdir/omp_jloop.pbs
!cd .. && make TAU_PROFILES_JLOOP
In [26]:
data = prepare_panel('omp_jloop')
pn_jloop = pd.Panel4D(data)
# dimensions: pn[vector size, threads, function name, metric]
pn_jloop.ix[:,:,'count_sort','Inclusive usec/call']
Out[26]:
In [27]:
pn_jloop.ix[:,:,'count_sort','Inclusive usec/call'].plot(marker='o', logx=True, logy=True)
plt.xlabel('Number of threads')
plt.ylabel('Microseconds')
Out[27]:
In [28]:
speedup_jloop = pn_jloop.ix[:,1,'count_sort','Inclusive usec/call'] / pn_jloop.ix[:,:,'count_sort','Inclusive usec/call']
speedup_jloop
Out[28]:
In [29]:
speedup_jloop.plot()
plt.legend(loc='best')
plt.xlabel('Number of threads')
plt.ylabel('Speedup')
plt.title("$j$-loop parallel for")
Out[29]:
For the j-loop case adding more threads just made the function slower, and this is expected because $\text{count}$ needs to be synchronized and other threads need to wait to update it.
In [37]:
speedup_jloop_qsort = pn_qsort.ix[:,1,'qsort_lib','Inclusive usec/call'] / pn_jloop.ix[:,:,'count_sort','Inclusive usec/call']
speedup_jloop_qsort
Out[37]:
In [39]:
speedup_jloop_qsort.plot()
plt.legend(loc='best')
plt.xlabel('Number of threads')
plt.ylabel('Speedup')
plt.title("$i$-loop parallel for")
Out[39]:
For the j-loop case adding more threads just made the function slower, and this is expected because $\text{count}$ needs to be synchronized and other threads need to wait to update it.
For the Xeon Phi implementation of qsort I had to declare cmpfunc as 'runnable' on a MIC architecture. The only modification besides that is the #pragma offload over the qsort call.
In [59]:
!pygmentize -l c -f html -O full -o ./mic_qsort.html ../src/mic_qsort.c
HTML(open('mic_qsort.html', 'r').read())
Out[59]:
For the i-loop Xeon Phi implementation I added the #pragma offload before the OpenMP parallel for.
In [60]:
!pygmentize -l c -f html -O full -o ./mic_iloop.html ../src/mic_iloop.c
HTML(open('mic_iloop.html', 'r').read())
Out[60]:
Since I used TAU for the other analysis, I wanted to use TAU for the Xeon Phi cases too. But HPCC didn't have it installed at first, and after it was installed the profile data wasn't generated. I contacted HPCC support but didn't got an answer in time for the homework.
It should be compute-bound, but ends up being memory-bound because of the bad memory access pattern. For each element of the array all the previous elements need to be accessed again, and more efficient sorting algorithms (like merge sort or even quicksort) can have better access patterns than count sort.